70 research outputs found

    Improving utilization of heterogeneous clusters

    Get PDF
    Datacenters often agglutinate sets of nodes with different capabilities, leading to a sub-optimal resource utilization. One of the best ways of improving utilization is to balance the load by taking into account the heterogeneity of these clusters. This article presents a novel way of expressing computational capacity, more adequate for heterogeneous clusters, and also advocates for task migration in order to further improve the utilization. The experimental evaluation shows that both proposals are advantageous and allow improving the utilization of heterogeneous clusters and reducing the makespan to 16.7% and 17.1%, respectively.This work has been supported by the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network) and the European HiPEAC Network of Excellenc

    Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

    Get PDF
    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016- 76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493). The European Research Council under grant agreement No 321253 European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc Projects, grant agreement n 288777, 610402 and 671697 and the European HiPEAC Network.Peer ReviewedPostprint (published version

    Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

    Get PDF
    A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

    To distribute or not to distribute: The question of load balancing for performance or energy

    Get PDF
    Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computin effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.This work has been supported by the University of Cantabria (CVE-2014-18166), the Spanish Science and Technology Commission (TIN2016-76635-C2-2-R), the European Research Council (G.A. No 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 671697.Peer ReviewedPostprint (author's final draft

    Un sistema para la docencia a distancia en asignaturas con hardware real

    Get PDF
    La docencia práctica en laboratorio de asignaturas centradas en el hardware como las del área de Estructura y Organización de Computadores se ha visto severamente afectada por el COVID-19. En este artículo se introduce un nuevo sistema de laboratorio remoto para la realización de sesiones prácticas basadas en Raspberry Pi ejecutando el sistema operativo RISC OS. El sistema gestiona tanto la alimentación de los equipos como la entrada/salida realizada a través de dispositivos periféricos, y permite al alumno visualizar e interaccionar con el escritorio del equipo remoto y con los dispositivos hardware conectados al mismo. Asimismo, el sistema facilita que un alumno y un profesor puedan visualizar el equipo remoto de forma simultánea en tiempo real, lo que facilita la resolución de dudas y la realización de pruebas de evaluación. El sistema combina una lógica de control basada en módulos Arduino y conexiones Ethernet con una interfaz web programada en lenguaje PHP. Con estas especiaciones, se ha desarrollado con éxito una prueba de concepto dotada de dos equipos remotos y dos interfaces de entrada.Los autores agradecen la colaboración de Fernando Vallejo, Carmen Martínez y Cristóbal Camarero. Este trabajo ha sido parcialmente financiado por la V Convocatoria de Proyectos de Innovación Docente, del Vicerrectorado de Ordenación Académica y Profesorado de la Universidad de Cantabria

    Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

    Full text link
    In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercomputer, the first Arm-based supercomputer entering the Top500 in November 2018. We study from micro-benchmarks up to large production codes. We include an interdisciplinary evaluation of three scientific applications (a finite-element fluid dynamics code, a smoothed particle hydrodynamics code, and a lattice Boltzmann code) and the Graph 500 benchmark, focusing on parallel and energy efficiency as well as studying their scalability up to thousands of Armv8 cores. For comparison, we run the same tests on state-of-the-art x86 nodes included in Dibona and the Tier-0 supercomputer MareNostrum4. Our experiments show that the ThunderX2 has a 25% lower performance on average, mainly due to its small vector unit yet somewhat compensated by its 30% wider links between the CPU and the main memory. We found that the software ecosystem of the Armv8 architecture is comparable to the one available for Intel. Our results also show that ThunderX2 delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems

    Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

    Get PDF
    The emergence of heterogeneous systems has been very notable recently. The nodes of the most powerful computers integrate several compute accelerators, like GPUs. Profiting from such node configurations is not a trivial endeavour. OmpSs is a framework for task based parallel applications, that allows the execution of OpenCl kernels on different compute devices. However, it does not support the co-execution of a single kernel on several devices. This paper presents an extension of OmpSs that rises to this challenge, and presents Auto-Tune, a load balancing algorithm that automatically adjusts its internal parameters to suit the hardware capabilities and application behavior. The extension allows programmers to take full advantage of the computing devices with negligible impact on the code. It takes care of two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. Experimental results reveal that the co-execution of single kernels on all the devices in the node is beneficial in terms of performance and energy consumption, and that Auto-Tune gives the best overall results.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016-76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493

    Assessing the Suitability of King Topologies for Interconnection Networks

    Get PDF
    In the late years many different interconnection networks have been used with two main tendencies. One is characterized by the use of high-degree routers with long wires while the other uses routers of much smaller degree. The latter rely on two-dimensional mesh and torus topologies with shorter local links. This paper focuses on doubling the degree of common 2D meshes and tori while still preserving an attractive layout for VLSI design. By adding a set of diagonal links in one direction, diagonal networks are obtained. By adding a second set of links, networks of degree eight are built, named king networks. This research presents a comprehensive study of these networks which includes a topological analysis, the proposal of appropriate routing procedures and an empirical evaluation. King networks exhibit a number of attractive characteristics which translate to reduced execution times of parallel applications. For example, the execution times NPB suite are reduced up to a 30 percent. In addition, this work reveals other properties of king networks such as perfect partitioning that deserves further attention for its convenient exploitation in forthcoming high-performance parallel systems

    Temperature Variations from HST Spectroscopy of the Orion Nebula

    Get PDF
    We present HST/STIS long-slit spectroscopy of NGC 1976. Our goal is to measure the intrinsic line ratio [O III] 4364/5008 and thereby evaluate the electron temperature (T_e) and the fractional mean-square T_e variation (t_A^2) across the nebula. We also measure the intrinsic line ratio [N II] 5756/6585 in order to estimate T_e and t_A^2 in the N^+ region. The interpretation of the [N II] data is not as clear cut as the [O III] data because of a higher sensitivity to knowledge of the electron density as well as a possible contribution to the [N II] 5756 emission by recombination (and cascading). We present results from binning the data along the various slits into tiles that are 0.5" square (matching the slit width). The average [O III] temperature for our four HST/STIS slits varies from 7678 K to 8358 K; t_A^2 varies from 0.00682 to at most 0.0176. For our preferred solution, the average [N II] temperature for each of the four slits varies from 9133 K to 10232 K; t_A^2 varies from 0.00584 to 0.0175. The measurements of T_e reported here are an average along each line of sight. Therefore, despite finding remarkably low t_A^2, we cannot rule out significantly larger temperature fluctuations along the line of sight. The result that the average [N II] T_e exceeds the average [O III] T_e confirms what has been previously found for Orion and what is expected on theoretical grounds. Observations of the proplyd P159-350 indicate: large local extinction associated; ionization stratification consistent with external ionization by theta^1 Ori C; and indirectly, evidence of high electron density.Comment: MNRAS accepted: 30 pages, 3 Figures, 2 Table

    Energy efficiency of load balancing for data-parallel applications in heterogeneous systems

    Get PDF
    The use of heterogeneous systems in supercomputing is on the rise as they improve both performance and energy e ciency. However, the pro- gramming of these machines requires considerable e ort to get the best results in massively data-parallel applications. Maat is a library that enables OpenCL programmers to e ciently execute single data-parallel kernels using all the available devices on a heterogeneous system. It o ers a set of load balanc- ing methods, which perform the data partitioning and distribution among the devices, exploiting more of the performance of the system and consequently re- ducing execution time. Until now, however, a study of the implications of these on the energy consumption has not been made. Therefore, this paper analyses the energy e ciency of the di erent load balancing methods compared to a baseline system that uses just a single GPU. To evaluate the impact of the heterogeneity of the system, the GPUs were set to di erent frequencies. The obtained results show that in all the studied cases there is at least one load balancing method that improves energy e ciency
    corecore